R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

INTRODUCTION

The purpose of this project is to do a comprehensive work on a provided wine data set. The work will involve application of correct statistical methods on the data and complete analysis of the data. It would also include correct and adequate interpretation and discussion on data, graphs, tables and results. The following areas were covered during the project work: 1. Importing wine data into R 2. Review of data in the text file 3. Cleaning of the data 4. Exploratory of data through visualization and 5. Drawing insights from the data.

Relevant Information: These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. The attributes are 1) Alcohol 2) Malic acid 3) Ash 4) Alcalinity of ash
5) Magnesium 6) Total phenols 7) Flavanoids 8) Nonflavanoid phenols 9) Proanthocyanins 10) Color intensity 11) Hue 12) OD280/OD315 of diluted wines 13) Proline
Number of Instances of the variable Class. class 1 59 class 2 71 class 3 48 There are 13 predictor variables and 1 target variable. - 18 missing data. - 1 misleading data and was treated as an outlier.

Loading the libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggrepel)

library(ggplot2)

library(corrplot)
## corrplot 0.92 loaded
library(tidyr)

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(olsrr)
## 
## Attaching package: 'olsrr'
## 
## The following object is masked from 'package:MASS':
## 
##     cement
## 
## The following object is masked from 'package:datasets':
## 
##     rivers
library(stats)


library(dplyr)

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(psych)
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Importing of wine data in R

# Load data into R
initial_data <- read.table("wine.txt", sep = ",", header = FALSE)
initial_data
##     V1    V2   V3   V4   V5    V6   V7   V8   V9  V10        V11   V12  V13
## 1    1 14.23 1.71   NA 15.6   127 2.80 3.06 0.28 2.29       5.64 1.040 3.92
## 2    1 13.20 1.78 2.14 11.2   100 2.65 2.76 0.26 1.28       4.38 1.050 3.40
## 3    1 13.16 2.36 2.67 18.6   101 2.80 3.24 0.30 2.81       5.68 1.030 3.17
## 4    1 14.37 1.95 2.50 16.8   113 3.85 3.49 0.24 2.18       7.80 0.860 3.45
## 5    1 13.24 2.59 2.87   21   118 2.80 2.69 0.39 1.82       4.32 1.040 2.93
## 6    1 14.20 1.76 2.45 15.2   112 3.27 3.39 0.34 1.97       6.75 1.050 2.85
## 7    1 14.39 1.87 2.45 14.6    96 2.50 2.52 0.30 1.98       5.25 1.020 3.58
## 8    1 14.06 2.15 2.61 17.6   121 2.60 2.51 0.31 1.25       5.05 1.060 3.58
## 9    1 14.83 1.64 2.17   14    97 2.80 2.98 0.29 1.98       5.20 1.080 2.85
## 10   1 13.86 1.35 2.27   16    98 2.98 3.15 0.22 1.85       7.22 1.010 3.55
## 11   1 14.10 2.16 2.30   18   105 2.95 3.32 0.22 2.38       5.75 1.250 3.17
## 12   1 14.12 1.48 2.32 16.8    95 2.20 2.43 0.26 1.57       5.00 1.170 2.82
## 13   1 13.75 1.73 2.41   16    89 2.60 2.76 0.29 1.81       5.60 1.150 2.90
## 14   1 14.75 1.73 2.39 11.4    91 3.10 3.69 0.43 2.81       5.40 1.250 2.73
## 15   1 14.38 1.87 2.38   12   102 3.30 3.64 0.29 2.96       7.50 1.200 3.00
## 16   1 13.63 1.81 2.70 17.2   112 2.85 2.91 0.30 1.46       7.30 1.280 2.88
## 17   1 14.30 1.92 2.72   20   120 2.80 3.14 0.33 1.97       6.20 1.070 2.65
## 18   1 13.83 1.57 2.62   20   115 2.95 3.40 0.40 1.72       6.60 1.130 2.57
## 19   1 14.19 1.59 2.48 16.5   108 3.30 3.93 0.32 1.86       8.70 1.230 2.82
## 20   1 13.64  3.1 2.56 15.2   116 2.70 3.03 0.17 1.66       5.10 0.960 3.36
## 21   1 14.06 1.63 2.28   16   126 3.00 3.17 0.24  2.1       5.65 1.090 3.71
## 22   1 12.93  3.8 2.65 18.6     . 2.41 2.41 0.25 1.98       4.50 1.030 3.52
## 23   1 13.71 1.86 2.36 16.6   101 2.61 2.88 0.27 1.69       3.80 1.110 4.00
## 24   1 12.85  1.6 2.52 17.8    95   NA 2.37 0.26 1.46       3.93 1.090 3.63
## 25   1 13.50 1.81 2.61   20    96 2.53 2.61 0.28 1.66       3.52 1.120 3.82
## 26   1 13.05 2.05 3.22   25   124 2.63 2.68 0.47 1.92       3.58 1.130 3.20
## 27   1 13.39 1.77 2.62 16.1    93 2.85 2.94 0.34 1.45       4.80 0.920   NA
## 28   1 13.30 1.72 2.14   17    94 2.40 2.19 0.27 1.35       3.95 1.020 2.77
## 29   1 13.87  1.9 2.80 19.4   107 2.95 2.97 0.37 1.76       4.50 1.250 3.40
## 30   1 14.02 1.68 2.21   16    96 2.65 2.33 0.26 1.98       4.70 1.040 3.59
## 31   1 13.73  1.5 2.70 22.5   101 3.00 3.25 0.29 2.38       5.70 1.190 2.71
## 32   1 13.58 1.66 2.36 19.1   106 2.86 3.19 0.22 1.95       6.90 1.090 2.88
## 33   1 13.68 1.83 2.36 17.2   104 2.42 2.69 0.42 1.97       3.84 1.230 2.87
## 34   1 13.76 1.53 2.70 19.5   132 2.95 2.74 0.50 1.35       5.40 1.250 3.00
## 35   1 13.51  1.8 2.65   19   110 2.35 2.53 0.29 1.54       4.20 1.100 2.87
## 36   1 13.48      2.41 NULL   100 2.70 2.98 0.26 1.86       5.10 1.040 3.47
## 37   1 13.28 1.64 2.84 15.5   110 2.60 2.68 0.34 1.36       4.60 1.090 2.78
## 38   1 13.05 1.65 2.55   18    98 2.45 2.43 0.29 1.44       4.25 1.120 2.51
## 39   1 13.07  1.5 2.10 15.5    98 2.40 2.64 0.28 1.37       3.70 1.180 2.69
## 40   1 14.22 3.99 2.51 13.2   128 3.00 3.04 0.20 2.08       5.10 0.890 3.53
## 41   1 13.56 1.71   NA 16.2   117 3.15   NA 0.34 2.34       6.13 0.950 3.38
## 42   1 13.41 3.84 2.12 18.8    90 2.45 2.68 0.27 1.48       4.28 0.910 3.00
## 43   1 13.88 1.89 2.59   15   101 3.25 3.56 0.17  1.7       5.43 0.880 3.56
## 44   1 13.24 3.98 2.29 17.5   103 2.64 2.63 0.32 1.66       4.36 0.820 3.00
## 45   1 13.05 1.77 2.10   17   107 3.00 3.00 0.28 2.03       5.04 0.880 3.35
## 46   1 14.21 4.04 2.44 18.9   111 2.85 2.65 0.30 1.25       5.24 0.870 3.33
## 47   1 14.38 3.59 2.28   16   102 3.25 3.17 0.27 2.19       4.90 1.040 3.44
## 48   1 13.90 1.68 2.12        101 3.10 3.39 0.21 2.14       6.10 0.910 3.33
## 49   1 14.10 2.02 2.40 18.8     . 2.75 2.92 0.32   na       6.20 1.070 2.75
## 50   1 13.94 1.73 2.27 17.4   108 2.88 3.54 0.32 2.08       8.90 1.120 3.10
## 51   1 13.05 1.73 2.04 12.4    92 2.72 3.27 0.17 2.91       7.20 1.120 2.91
## 52   1 13.83 1.65 2.60 17.2    94 2.45 2.99   NA 2.29       5.60 1.240 3.37
## 53   1 13.82    . 2.42   14   111 3.88 3.74 0.32 1.87       7.05 1.010 3.26
## 54   1 13.77  1.9   NA 17.1   115 3.00 2.79 0.39 1.68       6.30 1.130 2.93
## 55   1 13.74 1.67 2.25 16.4   118 2.60 2.90 0.21 1.62       5.85 0.920 3.20
## 56   1 13.56 1.73 2.46 20.5   116 2.96 2.78 0.20 2.45       6.25 0.980 3.03
## 57   1 14.22  1.7 2.30 16.3     . 3.20 3.00 0.26 2.03       6.38 0.940 3.31
## 58   1 13.29 1.97 2.68    .   102 3.00 3.23 0.31 1.66       6.00 1.070 2.84
## 59   1 13.72 1.43 2.50   na   108 3.40 3.67 0.19 2.04       6.80 0.890 2.87
## 60   2 12.37  .94 1.36   na    88 1.98 0.57 0.28  .42       1.95 1.050 1.82
## 61   2 12.33  1.1 2.28   16   101 2.05 1.09 0.63  .41       3.27 1.250 1.67
## 62   2 12.64 1.36 2.02 16.8   100 2.02 1.41 0.53  .62       5.75 0.980 1.59
## 63   2 13.67 1.25 1.92   18    94 2.10 1.79 0.32  .73       3.80 1.230 2.46
## 64   2 12.37 1.13 2.16   19    87 3.50 3.10 0.19 1.87       4.45 1.220 2.87
## 65   2 12.17 1.45 2.53   19   104 1.89 1.75 0.45 1.03       2.95 1.450 2.23
## 66   2 12.37 1.21 2.56 18.1    98 2.42 2.65 0.37 2.08       4.60 1.190 2.30
## 67   2 13.11 1.01 1.70   15    78 2.98 3.18 0.26 2.28       5.30 1.120 3.18
## 68   2 12.37 1.17 1.92 19.6    78 2.11 2.00 0.27 1.04       4.68 1.120 3.48
## 69   2 13.34  .94 2.36   17   110 2.53 1.30 0.55  .42       3.17 1.020 1.93
## 70   2 12.21 1.19 1.75 16.8   151 1.85 1.28 0.14  2.5       2.85 1.280 3.07
## 71   2 12.29 1.61 2.21 20.4   103 1.10 1.02 0.37 1.46       3.05 0.906 1.82
## 72   2 13.86 1.51 2.67   25    86 2.95 2.86 0.21 1.87       3.38 1.360 3.16
## 73   2 13.49 1.66 2.24   24    87 1.88 1.84 0.27 1.03       3.74 0.980 2.78
## 74   2 12.99 1.67 2.60   30   139 3.30 2.89 0.21 1.96       3.35 1.310 3.50
## 75   2 11.96 1.09 2.30   21   101 3.38 2.14 0.13 1.65       3.21 0.990 3.13
## 76   2 11.66 1.88 1.92   16    97 1.61 1.57 0.34 1.15       3.80 1.230 2.14
## 77   2 13.03   .9 1.71   16    86 1.95 2.03 0.24 1.46       4.60 1.190 2.48
## 78   2 11.84 2.89 2.23   18   112 1.72 1.32 0.43  .95       2.65 0.960 2.52
## 79   2 12.33  .99 1.95 14.8   136 1.90 1.85 0.35 2.76       3.40 1.060 2.31
## 80   2 12.70 3.87 2.40   23   101 2.83 2.55 0.43 1.95       2.57 1.190 3.13
## 81   2 12.00  .92 2.00   19    86 2.42 2.26 0.30 1.43       2.50 1.380 3.12
## 82   2 12.72 1.81 2.20 18.8    86 2.20 2.53 0.26 1.77       3.90 1.160 3.14
## 83   2 12.08 1.13 2.51   24    78 2.00 1.58 0.40  1.4       2.20 1.310 2.72
## 84   2 13.05 3.86 2.32 22.5    85 1.65 1.59 0.61 1.62       4.80 0.840 2.01
## 85   2 11.84  .89 2.58   18    94 2.20 2.21 0.22 2.35       3.05 0.790 3.08
## 86   2 12.67  .98 2.24   18    99 2.20 1.94 0.30 1.46       2.62 1.230 3.16
## 87   2 12.16 1.61 2.31 22.8    90 1.78 1.69 0.43 1.56       2.45 1.330 2.26
## 88   2 11.65 1.67 2.62   26    88 1.92 1.61 0.40 1.34       2.60 1.360 3.21
## 89   2 11.64 2.06 2.46 21.6    84 1.95 1.69 0.48 1.35       2.80 1.000 2.75
## 90   2 12.08 1.33 2.30 23.6    70 2.20 1.59 0.42 1.38       1.74 1.070 3.21
## 91   2 12.08 1.83 2.32 18.5    81 1.60 1.50 0.52 1.64       2.40 1.080 2.27
## 92   2 12.00 1.51 2.42   22    86 1.45 1.25 0.50 1.63       3.60 1.050 2.65
## 93   2 12.69 1.53 2.26 20.7    80 1.38 1.46 0.58 1.62       3.05 0.960 2.06
## 94   2 12.29 2.83 2.22   18    88 2.45 2.25 0.25 1.99       2.15 1.150 3.30
## 95   2 11.62 1.99 2.28   18    98 3.02 2.26 0.17 1.35       3.25 1.160 2.96
## 96   2 12.47 1.52 2.20   19   162 2.50 2.27 0.32 3.28       2.60 1.160 2.63
## 97   2 11.81 2.12 2.74 21.5   134 1.60 0.99 0.14 1.56       2.50 0.950 2.26
## 98   2 12.29 1.41 1.98   16    85 2.55 2.50 0.29 1.77       2.90 1.230 2.74
## 99   2 12.37 1.07 2.10 18.5    88 3.52 3.75 0.24 1.95       4.50 1.040 2.77
## 100  2 12.29 3.17 2.21   18    88 2.85 2.99 0.45 2.81       2.30 1.420 2.83
## 101  2 12.08 2.08 1.70 17.5    97 2.23 2.17 0.26  1.4       3.30 1.270 2.96
## 102  2 12.60 1.34 1.90 18.5    88 1.45 1.36 0.29 1.35       2.45 1.040 2.77
## 103  2 12.34 2.45 2.46   21    98 2.56 2.11 0.34 1.31       2.80 0.800 3.38
## 104  2 11.82 1.72 1.88 19.5    86 2.50 1.64 0.37 1.42       2.06 0.940 2.44
## 105  2 12.51 1.73 1.98 20.5    85 2.20 1.92 0.32 1.48       2.94 1.040 3.57
## 106  2 12.42 2.55 2.27   22    90 1.68 1.84 0.66 1.42       2.70 0.860 3.30
## 107  2 12.25 1.73 2.12   19    80 1.65 2.03 0.37 1.63       3.40 1.000 3.17
## 108  2 12.72 1.75 2.28 22.5    84 1.38 1.76 0.48 1.63       3.30 0.880 2.42
## 109  2 12.22 1.29 1.94   19    92 2.36 2.04 0.39 2.08       2.70 0.860 3.02
## 110  2 11.61 1.35 2.70   20    94 2.74 2.92 0.29 2.49       2.65 0.960 3.26
## 111  2 11.46 3.74 1.82 19.5   107 3.18 2.58 0.24 3.58       2.90 0.750 2.81
## 112  2 12.52 2.43 2.17   21    88 2.55 2.27 0.26 1.22       2.00 0.900 2.78
## 113  2 11.76 2.68 2.92   20   103 1.75 2.03 0.60 1.05       3.80 1.230 2.50
## 114  2 11.41  .74 2.50   21    88 2.48 2.01 0.42 1.44       3.08 1.100 2.31
## 115  2 12.08 1.39 2.50 22.5    84 2.56 2.29 0.43 1.04       2.90 0.930 3.19
## 116  2 11.03 1.51 2.20 21.5    85 2.46 2.17 0.52 2.01       1.90 1.710 2.87
## 117  2 11.82 1.47 1.99 20.8    86 1.98 1.60 0.30 1.53       1.95 0.950 3.33
## 118  2 12.42 1.61 2.19 22.5   108 2.00 2.09 0.34 1.61       2.06 1.060 2.96
## 119  2 12.77 3.43 1.98   16    80 1.63 1.25 0.43  .83       3.40 0.700 2.12
## 120  2 12.00 3.43 2.00   19    87 2.00 1.64 0.37 1.87       1.28 0.930 3.05
## 121  2 11.45  2.4 2.42   20    96 2.90 2.79 0.32 1.83       3.25 0.800 3.39
## 122  2 11.56 2.05 3.23 28.5   119 3.18 5.08 0.47 1.87       6.00 0.930 3.69
## 123  2 12.42 4.43 2.73 26.5   102 2.20 2.13 0.43 1.71       2.08 0.920 3.12
## 124  2 13.05  5.8 2.13 21.5    86 2.62 2.65 0.30 2.01       2.60 0.730 3.10
## 125  2 11.87 4.31 2.39   21    82 2.86 3.03 0.21 2.91       2.80 0.750 3.64
## 126  2 12.07 2.16 2.17   21    85 2.60 2.65 0.37 1.35       2.76 0.860 3.28
## 127  2 12.43 1.53 2.29 21.5    86 2.74 3.15 0.39 1.77       3.94 0.690 2.84
## 128  2 11.79 2.13 2.78 28.5    92 2.13 2.24 0.58 1.76       3.00 0.970 2.44
## 129  2 12.37 1.63 2.30 24.5    88 2.22 2.45 0.40  1.9       2.12 0.890 2.78
## 130  2 12.04  4.3 2.38   22    80 2.10 1.75 0.42 1.35       2.60 0.790 2.57
## 131  3 12.86 1.35 2.32   18   122 1.51 1.25 0.21  .94       4.10 0.760 1.29
## 132  3 12.88 2.99 2.40   20   104 1.30 1.22 0.24  .83       5.40 0.740 1.42
## 133  3 12.81 2.31 2.40   24    98 1.15 1.09 0.27  .83       5.70 0.660 1.36
## 134  3 12.70 3.55 2.36 21.5   106 1.70 1.20 0.17  .84       5.00 0.780 1.29
## 135  3 12.51 1.24 2.25 17.5    85 2.00 0.58 0.60 1.25       5.45 0.750 1.51
## 136  3 12.60 2.46 2.20 18.5    94 1.62 0.66 0.63  .94       7.10 0.730 1.58
## 137  3 12.25 4.72 2.54   21    89 1.38 0.47 0.53   .8       3.85 0.750 1.27
## 138  3 12.53 5.51 2.64   25    96 1.79 0.60 0.63  1.1       5.00 0.820 1.69
## 139  3 13.49 3.59 2.19 19.5    88 1.62 0.48 0.58  .88       5.70 0.810 1.82
## 140  3 12.84 2.96 2.61   24   101 2.32 0.60 0.53  .81       4.92 0.890 2.15
## 141  3 12.93 2.81 2.70   21    96 1.54 0.50 0.53  .75       4.60 0.770 2.31
## 142  3 13.36 2.56 2.35   20    89 1.40 0.50 0.37  .64       5.60 0.700 2.47
## 143  3 13.52 3.17 2.72 23.5    97 1.55 0.52 0.50  .55       4.35 0.890 2.06
## 144  3 13.62 4.95 2.35   20    92 2.00 0.80 0.47 1.02       4.40 0.910 2.05
## 145  3 12.25 3.88 2.20 18.5   112 1.38 0.78 0.29 1.14       8.21 0.650 2.00
## 146  3 13.16 3.57 2.15   21   102 1.50 0.55 0.43  1.3       4.00 0.600 1.68
## 147  3 13.88 5.04 2.23   20    80 0.98 0.34 0.40  .68       4.90 0.580 1.33
## 148  3 12.87 4.61 2.48 21.5    86 1.70 0.65 0.47  .86       7.65 0.540 1.86
## 149  3 13.32 3.24 2.38 21.5    92 1.93 0.76 0.45 1.25       8.42 0.550 1.62
## 150  3 13.08  3.9 2.36 21.5   113 1.41 1.39 0.34 1.14       9.40 0.570 1.33
## 151  3 13.50 3.12 2.62   24   123 1.40 1.57 0.22 1.25       8.60 0.590 1.30
## 152  3 12.79 2.67 2.48   22   112 1.48 1.36 0.24 1.26      10.80 0.480 1.47
## 153  3 13.11  1.9 2.75 25.5   116 2.20 1.28 0.26 1.56       7.10 0.610 1.33
## 154  3 13.23  3.3 2.28 18.5    98 1.80 0.83 0.61 1.87      10.52 0.560 1.51
## 155  3 12.58 1.29 2.10   20   103 1.48 0.58 0.53  1.4       7.60 0.580 1.55
## 156  3 13.17 5.19 2.32   22    93 1.74 0.63 0.61 1.55       7.90 0.600 1.48
## 157  3 13.84 4.12 2.38 19.5    89 1.80 0.83 0.48 1.56       9.01 0.570 1.64
## 158  3 12.45 3.03 2.64   27    97 1.90 0.58 0.63 1.14       7.50 0.670 1.73
## 159  3 14.34 1.68 2.70   25    98 2.80 1.31 0.53  2.7      13.00 0.570 1.96
## 160  3 13.48 1.67 2.64 22.5    89 2.60 1.10 0.52 2.29      11.75 0.570 1.78
## 161  3 12.36 3.83 2.38   21    88 2.30 0.92 0.50 1.04       7.65 0.560 1.58
## 162  3 13.69 3.26 2.54   20   107 1.83 0.56 0.50   .8       5.88 0.960 1.82
## 163  3 12.85 3.27 2.58   22   106 1.65 0.60 0.60  .96       5.58 0.870 2.11
## 164  3 12.96 3.45 2.35 18.5   106 1.39 0.70 0.40  .94       5.28 0.680 1.75
## 165  3 13.78 2.76 2.30   22    90 1.35 0.68 0.41 1.03       9.58 0.700 1.68
## 166  3 13.73 4.36 2.26 22.5    88 1.28 0.47 0.52 1.15       6.62 0.780 1.75
## 167  3 13.45  3.7 2.60   23   111 1.70 0.92 0.43 1.46      10.68 0.850 1.56
## 168  3 12.82 3.37 2.30 19.5    88 1.48 0.66 0.40  .97      10.26 0.720 1.75
## 169  3 13.58 2.58 2.69 24.5   105 1.55 0.84 0.39 1.54       8.66 0.740 1.80
## 170  3 13.40  4.6 2.86   25   112 1.98 0.96 0.27 1.11       8.50 0.670 1.92
## 171  3 12.20 3.03 2.32   19    96 1.25 0.49 0.40  .73       5.50 0.660 1.83
## 172  3 12.77 2.39 2.28 19.5    86 1.39 0.51 0.48  .64 9899999.00 0.570 1.63
## 173  3 14.16 2.51 2.48   20    91 1.68 0.70 0.44 1.24       9.70 0.620 1.71
## 174  3 13.71 5.65 2.45 20.5    95 1.68 0.61 0.52 1.06       7.70 0.640 1.74
## 175  3 13.40 3.91 2.48   23   102 1.80 0.75 0.43 1.41       7.30 0.700 1.56
## 176  3 13.27 4.28 2.26   20   120 1.59 0.69 0.43 1.35      10.20 0.590 1.56
## 177  3 13.17 2.59 2.37   20 99999 1.65 0.68 0.53 1.46       9.30 0.600 1.62
## 178  3 14.13  4.1 2.74 24.5    96 2.05 0.76 0.56 1.35       9.20 0.610 1.60
##      V14
## 1   1065
## 2   1050
## 3   1185
## 4   1480
## 5    735
## 6   1450
## 7   1290
## 8   1295
## 9   1045
## 10  1045
## 11  1510
## 12  1280
## 13  1320
## 14  1150
## 15  1547
## 16  1310
## 17  1280
## 18  1130
## 19  1680
## 20   845
## 21   780
## 22   770
## 23  1035
## 24  1015
## 25   845
## 26   830
## 27  1195
## 28  1285
## 29   915
## 30  1035
## 31  1285
## 32  1515
## 33   990
## 34  1235
## 35  1095
## 36   920
## 37   880
## 38  1105
## 39  1020
## 40   760
## 41   795
## 42  1035
## 43  1095
## 44   680
## 45   885
## 46  1080
## 47  1065
## 48   985
## 49  1060
## 50  1260
## 51  1150
## 52  1265
## 53  1190
## 54  1375
## 55  1060
## 56  1120
## 57   970
## 58  1270
## 59  1285
## 60   520
## 61   680
## 62   450
## 63   630
## 64   420
## 65   355
## 66   678
## 67   502
## 68   510
## 69   750
## 70   718
## 71   870
## 72   410
## 73   472
## 74   985
## 75   886
## 76   428
## 77   392
## 78   500
## 79   750
## 80   463
## 81   278
## 82   714
## 83   630
## 84   515
## 85   520
## 86   450
## 87   495
## 88   562
## 89   680
## 90   625
## 91   480
## 92   450
## 93   495
## 94   290
## 95   345
## 96   937
## 97   625
## 98   428
## 99   660
## 100  406
## 101  710
## 102  562
## 103  438
## 104  415
## 105  672
## 106  315
## 107  510
## 108  488
## 109  312
## 110  680
## 111  562
## 112  325
## 113  607
## 114  434
## 115  385
## 116  407
## 117  495
## 118  345
## 119  372
## 120  564
## 121  625
## 122  465
## 123  365
## 124  380
## 125  380
## 126  378
## 127  352
## 128  466
## 129  342
## 130  580
## 131  630
## 132  530
## 133  560
## 134  600
## 135  650
## 136  695
## 137  720
## 138  515
## 139  580
## 140  590
## 141  600
## 142  780
## 143  520
## 144  550
## 145  855
## 146  830
## 147  415
## 148  625
## 149  650
## 150  550
## 151  500
## 152  480
## 153  425
## 154  675
## 155  640
## 156  725
## 157  480
## 158  880
## 159  660
## 160  620
## 161  520
## 162  680
## 163  570
## 164  675
## 165  615
## 166  520
## 167  695
## 168  685
## 169  750
## 170  630
## 171  510
## 172  470
## 173  660
## 174  740
## 175  750
## 176  835
## 177  840
## 178  560

View data

# View the head of data
head(initial_data)
##   V1    V2   V3   V4   V5  V6   V7   V8   V9  V10  V11  V12  V13  V14
## 1  1 14.23 1.71   NA 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
## 2  1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
## 3  1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
## 4  1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
## 5  1 13.24 2.59 2.87   21 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93  735
## 6  1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450
# View the tail of data
tail(initial_data)
##     V1    V2   V3   V4   V5    V6   V7   V8   V9  V10  V11  V12  V13 V14
## 173  3 14.16 2.51 2.48   20    91 1.68 0.70 0.44 1.24  9.7 0.62 1.71 660
## 174  3 13.71 5.65 2.45 20.5    95 1.68 0.61 0.52 1.06  7.7 0.64 1.74 740
## 175  3 13.40 3.91 2.48   23   102 1.80 0.75 0.43 1.41  7.3 0.70 1.56 750
## 176  3 13.27 4.28 2.26   20   120 1.59 0.69 0.43 1.35 10.2 0.59 1.56 835
## 177  3 13.17 2.59 2.37   20 99999 1.65 0.68 0.53 1.46  9.3 0.60 1.62 840
## 178  3 14.13  4.1 2.74 24.5    96 2.05 0.76 0.56 1.35  9.2 0.61 1.60 560

Shape of the data

dim(initial_data)
## [1] 178  14

There are 178 rows and columns in the data.

DATA PRE-PROCESSING

Data pre-processing involves the following: 1. Data cleaning 2. Data integration 3. Data reduction 4. Data transformation For this this project, the focus will be mainly be on data cleaning. The framework for data cleaning are: 1. Understand the data structure. 2. Validate the fields and values. 3. Interpret statistics. 4. Visualize the data. These tasks are really important before the data is used for the model building and other requirements of the business or institution. This section will be treating missing data, outliers treatments, and plotting of graphs with statistical analysis.

Summary of data types and variable names.

str(initial_data)
## 'data.frame':    178 obs. of  14 variables:
##  $ V1 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ V2 : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ V3 : chr  "1.71" "1.78" "2.36" "1.95" ...
##  $ V4 : num  NA 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ V5 : chr  "15.6" "11.2" "18.6" "16.8" ...
##  $ V6 : chr  "127" "100" "101" "113" ...
##  $ V7 : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ V8 : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ V9 : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ V10: chr  "2.29" "1.28" "2.81" "2.18" ...
##  $ V11: num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ V12: num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ V13: num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ V14: int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

This section gives us a view of the data types in the data set. A summary of the data type is below: i. Integer (int) - 2 ii. Character - 4 iii. Numeric - 8 Total - 14. Also the variables of the data came with default names which were changed to the actually names provided in the additional text file.

Statistical summary of data

summary(initial_data)
##        V1              V2             V3                  V4       
##  Min.   :1.000   Min.   :11.03   Length:178         Min.   :1.360  
##  1st Qu.:1.000   1st Qu.:12.36   Class :character   1st Qu.:2.210  
##  Median :2.000   Median :13.05   Mode  :character   Median :2.360  
##  Mean   :1.938   Mean   :13.00                      Mean   :2.365  
##  3rd Qu.:3.000   3rd Qu.:13.68                      3rd Qu.:2.555  
##  Max.   :3.000   Max.   :14.83                      Max.   :3.230  
##                                                     NA's   :3      
##       V5                 V6                  V7              V8       
##  Length:178         Length:178         Min.   :0.980   Min.   :0.340  
##  Class :character   Class :character   1st Qu.:1.740   1st Qu.:1.200  
##  Mode  :character   Mode  :character   Median :2.350   Median :2.130  
##                                        Mean   :2.294   Mean   :2.022  
##                                        3rd Qu.:2.800   3rd Qu.:2.860  
##                                        Max.   :3.880   Max.   :5.080  
##                                        NA's   :1       NA's   :1      
##        V9             V10                 V11               V12        
##  Min.   :0.1300   Length:178         Min.   :      1   Min.   :0.4800  
##  1st Qu.:0.2700   Class :character   1st Qu.:      3   1st Qu.:0.7825  
##  Median :0.3400   Mode  :character   Median :      5   Median :0.9650  
##  Mean   :0.3627                      Mean   :  55623   Mean   :0.9574  
##  3rd Qu.:0.4400                      3rd Qu.:      6   3rd Qu.:1.1200  
##  Max.   :0.6600                      Max.   :9899999   Max.   :1.7100  
##  NA's   :1                                                             
##       V13             V14        
##  Min.   :1.270   Min.   : 278.0  
##  1st Qu.:1.930   1st Qu.: 500.5  
##  Median :2.780   Median : 673.5  
##  Mean   :2.608   Mean   : 746.9  
##  3rd Qu.:3.170   3rd Qu.: 985.0  
##  Max.   :4.000   Max.   :1680.0  
##  NA's   :1

Above is the statistical summary of variables in the data set. It gives the mean, median, maximum, minimum, 1st quartile, and 3rd quartile information for all variables in the data set. A view of this data shows that there could be possible outliers. Also the whole data set is numeric so would be changing the character data type to numeric data. The variable V11 shows misleading value of 9899999.

Assign variable name to all the variables.

# Rename column names
colnames(initial_data) <- c("Alcohol", "Malic_acid", "Ash", "Alcalinity_of_ash", "Magnesium", "Total_phenols", "Flavanoids", "Nonflavanoid_phenols", "Proanthocyanins", "Color_intensity", "Hue", "12", "OD280_OD315_of_diluted_wines", "Proline")
# Display the column names.
colnames(initial_data)
##  [1] "Alcohol"                      "Malic_acid"                  
##  [3] "Ash"                          "Alcalinity_of_ash"           
##  [5] "Magnesium"                    "Total_phenols"               
##  [7] "Flavanoids"                   "Nonflavanoid_phenols"        
##  [9] "Proanthocyanins"              "Color_intensity"             
## [11] "Hue"                          "12"                          
## [13] "OD280_OD315_of_diluted_wines" "Proline"
summary(initial_data)
##     Alcohol        Malic_acid        Ash            Alcalinity_of_ash
##  Min.   :1.000   Min.   :11.03   Length:178         Min.   :1.360    
##  1st Qu.:1.000   1st Qu.:12.36   Class :character   1st Qu.:2.210    
##  Median :2.000   Median :13.05   Mode  :character   Median :2.360    
##  Mean   :1.938   Mean   :13.00                      Mean   :2.365    
##  3rd Qu.:3.000   3rd Qu.:13.68                      3rd Qu.:2.555    
##  Max.   :3.000   Max.   :14.83                      Max.   :3.230    
##                                                     NA's   :3        
##   Magnesium         Total_phenols        Flavanoids    Nonflavanoid_phenols
##  Length:178         Length:178         Min.   :0.980   Min.   :0.340       
##  Class :character   Class :character   1st Qu.:1.740   1st Qu.:1.200       
##  Mode  :character   Mode  :character   Median :2.350   Median :2.130       
##                                        Mean   :2.294   Mean   :2.022       
##                                        3rd Qu.:2.800   3rd Qu.:2.860       
##                                        Max.   :3.880   Max.   :5.080       
##                                        NA's   :1       NA's   :1           
##  Proanthocyanins  Color_intensity         Hue                12        
##  Min.   :0.1300   Length:178         Min.   :      1   Min.   :0.4800  
##  1st Qu.:0.2700   Class :character   1st Qu.:      3   1st Qu.:0.7825  
##  Median :0.3400   Mode  :character   Median :      5   Median :0.9650  
##  Mean   :0.3627                      Mean   :  55623   Mean   :0.9574  
##  3rd Qu.:0.4400                      3rd Qu.:      6   3rd Qu.:1.1200  
##  Max.   :0.6600                      Max.   :9899999   Max.   :1.7100  
##  NA's   :1                                                             
##  OD280_OD315_of_diluted_wines    Proline      
##  Min.   :1.270                Min.   : 278.0  
##  1st Qu.:1.930                1st Qu.: 500.5  
##  Median :2.780                Median : 673.5  
##  Mean   :2.608                Mean   : 746.9  
##  3rd Qu.:3.170                3rd Qu.: 985.0  
##  Max.   :4.000                Max.   :1680.0  
##  NA's   :1

The summary above display the change in variable name which was successful.

Change data type from character to numeric so that

initial_data$Ash <- as.numeric(initial_data$Ash)
## Warning: NAs introduced by coercion
initial_data$Magnesium <- as.numeric(initial_data$Magnesium)
## Warning: NAs introduced by coercion
initial_data$Total_phenols <- as.integer(initial_data$Total_phenols)
## Warning: NAs introduced by coercion
initial_data$Color_intensity <- as.numeric(initial_data$Color_intensity)
## Warning: NAs introduced by coercion
str(initial_data)
## 'data.frame':    178 obs. of  14 variables:
##  $ Alcohol                     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Malic_acid                  : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Ash                         : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Alcalinity_of_ash           : num  NA 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Magnesium                   : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Total_phenols               : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Flavanoids                  : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Nonflavanoid_phenols        : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Proanthocyanins             : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Color_intensity             : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Hue                         : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ 12                          : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ OD280_OD315_of_diluted_wines: num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline                     : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

The above confirm the changing of the character data type to numeric data type.

Working on missing data.

# Checking for missing data
initial_data_miss <- sum(is.na(initial_data))
cat (".", "\n")
## .
cat ("Missing data in this data set is : ", initial_data_miss)
## Missing data in this data set is :  18
# Identify columns with missing data
missing_cols <- colnames(initial_data)[apply(is.na(initial_data), 2, any)]
missing_cols
## [1] "Ash"                          "Alcalinity_of_ash"           
## [3] "Magnesium"                    "Total_phenols"               
## [5] "Flavanoids"                   "Nonflavanoid_phenols"        
## [7] "Proanthocyanins"              "Color_intensity"             
## [9] "OD280_OD315_of_diluted_wines"

Above is the columns that have missing data.

Removing of missing identified missing data in the data set.

# Replace the missing data in the integer variable 'Total_phenols' with median of that column.
initial_data$Total_phenols[is.na(initial_data$Total_phenols)]<- median(initial_data$Total_phenols,na.rm = TRUE)
initial_data$Ash[is.na(initial_data$Ash)]<- mean(initial_data$Ash,na.rm = TRUE)
initial_data$Alcalinity_of_ash[is.na(initial_data$Alcalinity_of_ash)]<- mean(initial_data$Alcalinity_of_ash,na.rm = TRUE)
initial_data$Magnesium[is.na(initial_data$Magnesium)]<- mean(initial_data$Magnesium,na.rm = TRUE)
initial_data$Flavanoids[is.na(initial_data$Flavanoids)]<- mean(initial_data$Flavanoids,na.rm = TRUE)
initial_data$Nonflavanoid_phenols[is.na(initial_data$Nonflavanoid_phenols)]<- mean(initial_data$Nonflavanoid_phenols,na.rm = TRUE)
initial_data$Proanthocyanins[is.na(initial_data$Proanthocyanins)]<- mean(initial_data$Proanthocyanins,na.rm = TRUE)
initial_data$Color_intensity[is.na(initial_data$Color_intensity)]<- mean(initial_data$Color_intensity,na.rm = TRUE)
initial_data$OD280_OD315_of_diluted_wines[is.na(initial_data$OD280_OD315_of_diluted_wines)]<- mean(initial_data$OD280_OD315_of_diluted_wines,na.rm = TRUE)

Confirming that the missing data have been treated with replace with mean and median. The miss leading data (9899999) will be treated with the outlier removal.

# Checking for missing data
initial_data_miss <- sum(is.na(initial_data))
cat (".", "\n")
## .
cat ("Missing data in this data set is : ", initial_data_miss)
## Missing data in this data set is :  0

Identification of outliers and treatment

outliner_data = par(mfrow = c(1,2))
for ( i in 1:14 ) 
 {
  boxplot(initial_data[[i]], col = "green")
  mtext(names(initial_data)[i], cex = 0.8, side = 1, line = 2)
 }

par(outliner_data)

The plots above displays the boxplot for the 14 variables. The following variables have outliers in them: 1. Ash 2. Alcalinity of ash 3. Magnesium 4. Total phenols 5. Color intensity 6. Hue 7. OD280/OD315 of diluted wines There is no outlier in the target variable, Alcohol.

data_outliers = c()
for ( i in 1:14 ) 
  {
  stats = boxplot.stats(initial_data[[i]])$stats
  b_outlier_rows = which(initial_data[[i]] < stats[1])
  t_outlier_rows = which(initial_data[[i]] > stats[5])
  data_outliers = c(data_outliers , t_outlier_rows[ !t_outlier_rows %in% data_outliers ] )
  data_outliers = c(data_outliers , b_outlier_rows[ !b_outlier_rows %in% data_outliers ] )
}
cat("The outlier observations are:", "\n")
## The outlier observations are:
data_outliers
##  [1] 124 138 174  26 122  60  67 101  74 128   2  14  70  79  96 177 111 152 159
## [20] 160 172 116

Application of Cook’s distance to detect influential observations.

mod_cook = lm(Alcohol ~ ., data = initial_data)
sd_1 = cooks.distance(mod_cook)
plot(sd_1, pch = "*", cex = 2, main = "Influential Obs by Cooks distance")
abline(h = 4*mean(sd_1, na.rm = T), col = "red")

Based on the Cook’s distance to detect influential observations, the outliers would be removed including the 1 misleading value.

c_outliers = as.numeric(rownames(initial_data[sd_1 > 4 * mean(sd_1, na.rm=T), ]))
data_outliers = c(data_outliers , c_outliers[ !c_outliers %in% data_outliers ] )

# New without outliers now called data.
data = initial_data[-data_outliers, ]

Summary of statistics to show outliers have been removed from the data. The data is now ready for additional visualizations.

# Print summary of new data.
summary(data)
##     Alcohol        Malic_acid         Ash        Alcalinity_of_ash
##  Min.   :1.000   Min.   :11.41   Min.   :0.740   Min.   :1.710    
##  1st Qu.:1.000   1st Qu.:12.37   1st Qu.:1.607   1st Qu.:2.237    
##  Median :2.000   Median :13.06   Median :1.875   Median :2.360    
##  Mean   :1.904   Mean   :13.04   Mean   :2.337   Mean   :2.373    
##  3rd Qu.:3.000   3rd Qu.:13.71   3rd Qu.:3.132   3rd Qu.:2.540    
##  Max.   :3.000   Max.   :14.83   Max.   :5.190   Max.   :2.920    
##    Magnesium     Total_phenols      Flavanoids    Nonflavanoid_phenols
##  Min.   :12.00   Min.   : 70.00   Min.   :0.980   Min.   :0.340       
##  1st Qu.:17.48   1st Qu.: 88.00   1st Qu.:1.715   1st Qu.:1.215       
##  Median :19.50   Median : 98.00   Median :2.310   Median :2.100       
##  Mean   :19.44   Mean   : 98.55   Mean   :2.284   Mean   :2.024       
##  3rd Qu.:21.12   3rd Qu.:106.00   3rd Qu.:2.800   3rd Qu.:2.885       
##  Max.   :27.00   Max.   :134.00   Max.   :3.880   Max.   :3.930       
##  Proanthocyanins  Color_intensity      Hue               12        
##  Min.   :0.1300   Min.   :0.410   Min.   : 1.280   Min.   :0.5400  
##  1st Qu.:0.2700   1st Qu.:1.235   1st Qu.: 3.250   1st Qu.:0.7975  
##  Median :0.3400   Median :1.535   Median : 4.750   Median :0.9600  
##  Mean   :0.3591   Mean   :1.539   Mean   : 5.002   Mean   :0.9577  
##  3rd Qu.:0.4300   3rd Qu.:1.870   3rd Qu.: 6.200   3rd Qu.:1.1125  
##  Max.   :0.6600   Max.   :2.960   Max.   :10.680   Max.   :1.4500  
##  OD280_OD315_of_diluted_wines    Proline      
##  Min.   :1.270                Min.   : 278.0  
##  1st Qu.:2.007                1st Qu.: 507.5  
##  Median :2.780                Median : 675.0  
##  Mean   :2.620                Mean   : 757.6  
##  3rd Qu.:3.170                3rd Qu.:1023.8  
##  Max.   :4.000                Max.   :1680.0
# Print dimension of data
dim(data)
## [1] 156  14

Exploratory data analysis

This section will deal with univariate, bivariate and multivariate analysis of the data set.

The diagrams below are initial histogram of the variables with the mean value of each variable.

dist_var = par(mfrow = c(1,2))
for ( i in 2:14 ) 
  {
  truehist(data[[i]], xlab = names(data)[i], col = 'lightgreen', main = paste("Average =", signif(mean(data[[i]]),3)))
 }

Observations: 1. These variables were skewed to the right - Ash, Total Phenols, Proanthocyanins, Hue and Proline. 2. The rest of the data are skewed to the left per the display from the diagram.

A plot of the target (Alcohol) variable.

ggplot(initial_data, aes(x = Alcohol)) +
  geom_histogram(bins = 10, position = 'identity', alpha = 0.4, fill = "blue") + labs(title = "Histogram of Alcohol variable") + geom_text(aes(label = scales::percent(..count../sum(..count..))), stat = 'count', vjust = -0.5)

The target variable has 3 classes and distributed by this percentages: i. Class 1 - 33.1% ii. Class 2 - 39.9% iii. Class 3 - 27%

Display of pairplot of all variables.

pairs(data)

The pairplot above is not very visible for interpretation so will create a ggplot for visibility and clarity.

Converting target variable Alcohol from integer to character data type to be able to plot.

data$Alcohol <- as.character(data$Alcohol)
ggpairs(data, columns = 2:5, aes(color = Alcohol, alpha = 0.5), upper = list(continuous = wrap("cor", size = 4)))

Observations: The data in the 4 variables are evenly distributed.

ggpairs(data, columns = 6:9, aes(color = Alcohol, alpha = 0.5), upper = list(continuous = wrap("cor", size = 4)))

Observations: The data in the 4 variables are all not evenly distributed.

ggpairs(data, columns = 10:14, aes(color = Alcohol, alpha = 0.5), upper = list(continuous = wrap("cor", size = 3)))

Displaying correlation between the variables in the data set.

data$Alcohol <- as.integer(data$Alcohol)

cor(data)
##                                  Alcohol  Malic_acid         Ash
## Alcohol                       1.00000000 -0.36970312  0.45461931
## Malic_acid                   -0.36970312  1.00000000  0.10649076
## Ash                           0.45461931  0.10649076  1.00000000
## Alcalinity_of_ash            -0.06352005  0.21113024  0.17134163
## Magnesium                     0.56728900 -0.33315892  0.28469233
## Total_phenols                -0.25621139  0.42787825  0.02075724
## Flavanoids                   -0.74897566  0.32909832 -0.35142113
## Nonflavanoid_phenols         -0.87830665  0.30021648 -0.44515057
## Proanthocyanins               0.49596382 -0.19166875  0.29017899
## Color_intensity              -0.59988675  0.19889666 -0.23022086
## Hue                           0.19204566  0.56900577  0.31098169
## 12                           -0.62645754 -0.00936294 -0.58806273
## OD280_OD315_of_diluted_wines -0.78847145  0.11346656 -0.38854082
## Proline                      -0.64491405  0.66063816 -0.18176581
##                              Alcalinity_of_ash   Magnesium Total_phenols
## Alcohol                            -0.06352005  0.56728900 -0.2562113876
## Malic_acid                          0.21113024 -0.33315892  0.4278782486
## Ash                                 0.17134163  0.28469233  0.0207572412
## Alcalinity_of_ash                   1.00000000  0.30989743  0.4126787666
## Magnesium                           0.30989743  1.00000000 -0.2034730000
## Total_phenols                       0.41267877 -0.20347300  1.0000000000
## Flavanoids                          0.12237815 -0.43213847  0.2544180011
## Nonflavanoid_phenols                0.06652652 -0.47657034  0.2119999204
## Proanthocyanins                     0.07066865  0.32949206 -0.2566860344
## Color_intensity                     0.04340127 -0.30857546  0.1100978370
## Hue                                 0.21647217 -0.04890394  0.3564850589
## 12                                 -0.01583350 -0.31496548 -0.0007513791
## OD280_OD315_of_diluted_wines       -0.01990703 -0.35933498  0.0280598995
## Proline                             0.25482074 -0.47659792  0.4423740852
##                              Flavanoids Nonflavanoid_phenols Proanthocyanins
## Alcohol                      -0.7489757          -0.87830665      0.49596382
## Malic_acid                    0.3290983           0.30021648     -0.19166875
## Ash                          -0.3514211          -0.44515057      0.29017899
## Alcalinity_of_ash             0.1223781           0.06652652      0.07066865
## Magnesium                    -0.4321385          -0.47657034      0.32949206
## Total_phenols                 0.2544180           0.21199992     -0.25668603
## Flavanoids                    1.0000000           0.87292951     -0.49423764
## Nonflavanoid_phenols          0.8729295           1.00000000     -0.59595903
## Proanthocyanins              -0.4942376          -0.59595903      1.00000000
## Color_intensity               0.6390263           0.72581281     -0.43779612
## Hue                          -0.0233295          -0.13209335      0.07465519
## 12                            0.4573272           0.57833786     -0.25235424
## OD280_OD315_of_diluted_wines  0.6948129           0.77314613     -0.50695562
## Proline                       0.5273710           0.54113981     -0.31295733
##                              Color_intensity         Hue            12
## Alcohol                          -0.59988675  0.19204566 -0.6264575410
## Malic_acid                        0.19889666  0.56900577 -0.0093629402
## Ash                              -0.23022086  0.31098169 -0.5880627320
## Alcalinity_of_ash                 0.04340127  0.21647217 -0.0158335023
## Magnesium                        -0.30857546 -0.04890394 -0.3149654845
## Total_phenols                     0.11009784  0.35648506 -0.0007513791
## Flavanoids                        0.63902628 -0.02332950  0.4573272378
## Nonflavanoid_phenols              0.72581281 -0.13209335  0.5783378596
## Proanthocyanins                  -0.43779612  0.07465519 -0.2523542436
## Color_intensity                   1.00000000 -0.00690653  0.3316192431
## Hue                              -0.00690653  1.00000000 -0.4503957523
## 12                                0.33161924 -0.45039575  1.0000000000
## OD280_OD315_of_diluted_wines      0.59674385 -0.39936013  0.5604297231
## Proline                           0.38752662  0.39641180  0.2423489538
##                              OD280_OD315_of_diluted_wines    Proline
## Alcohol                                       -0.78847145 -0.6449141
## Malic_acid                                     0.11346656  0.6606382
## Ash                                           -0.38854082 -0.1817658
## Alcalinity_of_ash                             -0.01990703  0.2548207
## Magnesium                                     -0.35933498 -0.4765979
## Total_phenols                                  0.02805990  0.4423741
## Flavanoids                                     0.69481294  0.5273710
## Nonflavanoid_phenols                           0.77314613  0.5411398
## Proanthocyanins                               -0.50695562 -0.3129573
## Color_intensity                                0.59674385  0.3875266
## Hue                                           -0.39936013  0.3964118
## 12                                             0.56042972  0.2423490
## OD280_OD315_of_diluted_wines                   1.00000000  0.3133982
## Proline                                        0.31339821  1.0000000
corrplot(cor(data))

From the diagram above, the following variables have good correlation: 1. Flavanoids 2. Nonflavanoid_phenols 3. OD280_OD315_of_diluted_wines 4. Proline. These variables could be considered for future data processing activities. Some of the variables also had negative correlation.

CONCLUSION

This project has been an extensive exercise of data analysis and visualization of the wine data provided. The following key observations was made during the data analysis: 1. Most of the time was spent on cleaning the data and visualizing the data. 2. The identified missing data were successfully treated. This was done by replacing them with the mean (numeric data type) or median (for integer data type) of the variable. 3. Identified outliers were successfully treated or removed. 3. In terms of correlation, 4 key variables were identified to be relevant to the data and could considered for processing in future. 4. Majority of the most of the data had their mean and median being closed to each other. After the extensive cleaning of these data and analysis, we can firm that this wine data is relevant and can be use for further research work and statistical or machine learning model building.